In [1]:

    
import pandas as pd
import sqlite3

9.1 Reading data from SQL databases

So far we've only talked about reading data from CSV files. That's a pretty common way to store data, but there are many others! Pandas can read from HTML, JSON, SQL, Excel (!!!), HDF5, Stata, and a few other things. In this chapter we'll talk about reading data from SQL databases.

You can read data from a SQL database using the pd.read_sql function. read_sql will automatically convert SQL column names to DataFrame column names.

read_sql takes 2 arguments: a SELECT statement, and a database connection object. This is great because it means you can read from any kind of SQL database -- it doesn't matter if it's MySQL, SQLite, PostgreSQL, or something else.

This example reads from a SQLite database, but any other database would work the same way.



In [2]:

    
con = sqlite3.connect("../data/weather_2012.sqlite")
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con)
df









    Out[2]:






  
    
      
      id
      date_time
      temp
    
  
  
    
      0
       1
       2012-01-01 00:00:00
      -1.8
    
    
      1
       2
       2012-01-01 01:00:00
      -1.8
    
    
      2
       3
       2012-01-01 02:00:00
      -1.8

read_sql doesn't automatically set the primary key (id) to be the index of the dataframe. You can make it do that by adding an index_col argument to read_sql.

If you've used read_csv a lot, you may have seen that it has an index_col argument as well. This one behaves the same way.



In [3]:

    
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, index_col='id')
df









    Out[3]:






  
    
      
      date_time
      temp
    
    
      id
      
      
    
  
  
    
      1
       2012-01-01 00:00:00
      -1.8
    
    
      2
       2012-01-01 01:00:00
      -1.8
    
    
      3
       2012-01-01 02:00:00
      -1.8

If you want your dataframe to be indexed by more than one column, you can give a list of columns to index_col:



In [4]:

    
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, 
                 index_col=['id', 'date_time'])
df









    Out[4]:






  
    
      
      
      temp
    
    
      id
      date_time
      
    
  
  
    
      1
      2012-01-01 00:00:00
      -1.8
    
    
      2
      2012-01-01 01:00:00
      -1.8
    
    
      3
      2012-01-01 02:00:00
      -1.8

9.2 Writing to a SQLite database

Pandas has a write_frame function which creates a database table from a dataframe. Right now this only works for SQLite databases. Let's use it to move our 2012 weather data into SQL.

You'll notice that this function is in pd.io.sql. There are a ton of useful functions for reading and writing various kind of data in pd.io, and it's worth spending some time exploring them. (see the documentation!)



In [5]:

    
weather_df = pd.read_csv('../data/weather_2012.csv')
con = sqlite3.connect("../data/test_db.sqlite")
con.execute("DROP TABLE IF EXISTS weather_2012")
pd.io.sql.write_frame(weather_df, "weather_2012", con)

We can now read from the weather_2012 table in test_db.sqlite, and we see that we get the same data back:



In [6]:

    
con = sqlite3.connect("../data/test_db.sqlite")
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con)
df









    Out[6]:






  
    
      
      Date/Time
      Temp_(C)
      Dew_Point_Temp_(C)
      Rel_Hum_(%)
      Wind_Spd_(km/h)
      Visibility_(km)
      Stn_Press_(kPa)
      Weather
    
  
  
    
      0
       2012-01-01 00:00:00
      -1.8
      -3.9
       86
       4
       8
       101.24
                        Fog
    
    
      1
       2012-01-01 01:00:00
      -1.8
      -3.7
       87
       4
       8
       101.24
                        Fog
    
    
      2
       2012-01-01 02:00:00
      -1.8
      -3.4
       89
       7
       4
       101.26
       Freezing Drizzle,Fog

The nice thing about having your data in a database is that you can do arbitrary SQL queries. This is cool especially if you're more familiar with SQL. Here's an example of sorting by the Weather column:



In [7]:

    
con = sqlite3.connect("../data/test_db.sqlite")
df = pd.read_sql("SELECT * from weather_2012 ORDER BY Weather LIMIT 3", con)
df









    Out[7]:






  
    
      
      Date/Time
      Temp_(C)
      Dew_Point_Temp_(C)
      Rel_Hum_(%)
      Wind_Spd_(km/h)
      Visibility_(km)
      Stn_Press_(kPa)
      Weather
    
  
  
    
      0
       2012-01-03 19:00:00
      -16.9
      -24.8
       50
       24
       25
       101.74
       Clear
    
    
      1
       2012-01-05 18:00:00
       -7.1
      -14.4
       56
       11
       25
       100.71
       Clear
    
    
      2
       2012-01-05 19:00:00
       -9.2
      -15.4
       61
        7
       25
       100.80
       Clear

If you have a PostgreSQL database or MySQL database, reading from it works exactly the same way as reading from a SQLite database. You create a connection using psycopg2.connect() or MySQLdb.connect(), and then use

pd.read_sql("SELECT whatever from your_table", con)

9.3 Connecting to other kinds of database

To connect to a MySQL database:



In [ ]:

    
import MySQLdb
con = MySQLdb.connect(host="localhost", db="test")

To connect to a PostgreSQL database:



In [ ]:

    
import psycopg2
con = psycopg2.connect(host="localhost")

	id	date_time	temp
0	1	2012-01-01 00:00:00	-1.8
1	2	2012-01-01 01:00:00	-1.8
2	3	2012-01-01 02:00:00	-1.8

		temp
id	date_time
1	2012-01-01 00:00:00	-1.8
2	2012-01-01 01:00:00	-1.8
3	2012-01-01 02:00:00	-1.8

	Date/Time	Temp_(C)	Dew_Point_Temp_(C)	Rel_Hum_(%)	Wind_Spd_(km/h)	Visibility_(km)	Stn_Press_(kPa)	Weather
0	2012-01-01 00:00:00	-1.8	-3.9	86	4	8	101.24	Fog
1	2012-01-01 01:00:00	-1.8	-3.7	87	4	8	101.24	Fog
2	2012-01-01 02:00:00	-1.8	-3.4	89	7	4	101.26	Freezing Drizzle,Fog

	Date/Time	Temp_(C)	Dew_Point_Temp_(C)	Rel_Hum_(%)	Wind_Spd_(km/h)	Visibility_(km)	Stn_Press_(kPa)	Weather
0	2012-01-03 19:00:00	-16.9	-24.8	50	24	25	101.74	Clear
1	2012-01-05 18:00:00	-7.1	-14.4	56	11	25	100.71	Clear
2	2012-01-05 19:00:00	-9.2	-15.4	61	7	25	100.80	Clear